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Abstract 

A filtration of a formal language L by a sequence s maps L to the set of words 
formed by taking the letters of words of L indexed only by s. We consider the languages 
resulting from filtering by all arithmetic progressions. If L is regular, it is easy to see 
that only finitely many distinct languages result. By contrast, there exist CFL's that 
give infinitely many distinct languages as a result. We use our technique to show that 
the operation diag, which extracts the diagonal of words of square length arranged in 
a square array, preserves regularity but does not preserve context-freeness. 

1 Introduction 

Let s = (s(z))j>o be an infinite strictly increasing sequence of non-negative integers. Berstel 
et al. [1] introduced the notion of filtering by s: given a finite word w = a^ai ■ ■ ■ a n , we write 
w[s] = a s (o)£t s (i) • • • a s (k), where k is the largest integer such that s(k) < n < s(k+l). (If there 
is no such integer, then w[s] = e.) Given a language L, we define L[s] = {w[s} : w G L}. 

Example 1. If w = theorem, and s = 0,2,4,6,..., the sequence of even integers, then 
w[s] = term. If t = 1, 3, 5, . . ., the sequence of odd integers, then w[t] = hoe. 

Berstel et al. [T| proved a number of theorems about niters, and characterized those 
sequences s that preserve regularity (i.e., L[s] is always regular if L is) and context-freeness. 

In this note we revisit the concept of filtering from a slightly different point of view. 
Suppose we have an infinite set of filters S = {si, S2, ■ ■ ■}■ Given a language L, what can be 
said about the set of all filtered languages {X[sj] : % > 1}? For example, is it finite? 



In this note we are only concerned with filters s that represent arithmetic progressions: 
there exist integers a > 1, b > such that Si = ai + b for % > 0. We consider four different 
types of filter sets: 

(a) a > 1 and 6 = 0: the weak arithmetic progressions 

(b) a > 1 and < b < a: the ordinary arithmetic progressions 

(c) a > 1 and & > 0: the strong arithmetic progressions 

(d) a = 1 and 6 > 0: the shifts 

If L is regular, a simple argument (given below) shows that filtration by the strong 
arithmetic progressions produces only finitely many distinct languages (and hence the same is 
true for filtration by the weak and ordinary arithmetic progressions and shifts). By contrast, 
there exist context-free languages L so that filtering only by the weak arithmetic progressions 
or the shifts produces infinitely many distinct languages (and hence the same is true for the 
ordinary and strong arithmetic progressions). 

In Section H] we introduce a natural operation on formal languages that is related to the 
results of Berstel et al. [lj, but seemingly cannot be analyzed using their framework. We 
show that this operation preserves regularity, but does not preserve context-freeness. 

We adopt the following notation: if L is a language, and s = (sj)j> is an arithmetic 
progression such that Sj = ai + b, then we define L a b := L[s}. Similarly, if w is a word, we 
define w a ^ := w[s). 

2 The regular case 

Theorem 2. If L is regular, then filtering by the strong arithmetic progressions produces 
finitely many distinct languages. 

Remark 3. It is easy to see that if L is regular and s is an arithmetic progression, then L[s] 
is regular. Indeed, this follows immediately from the theorem that the regular languages are 
closed under applying a transducer, since it is easy to make a transducer that extracts the 
letters corresponding to indices in s. That is not the issue here; we need to see that among 
all the regular languages produced by filtering by a strong arithmetic progression, there are 
only finitely many distinct languages. 

Proof. Let A = (Q,T,,S,q , F) be a DFA accepting L. Our proof is based on the boolean 
matrix interpretation of automata [3] . Let M c be the boolean incidence matrix of the under- 
lying transition graph of the automaton corresponding to a transition on the symbol c € S. 
That is, if Q = {q , qi, ■ ■ ■ , q n -i}, then 




1, if 6(qi,c) = qj] 
0, otherwise. 
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We also write M = Vcgs -^c- By standard results about path algebra, the matrix M n has a 
1 in row % and column j if and only if there is a length- n path from to qj. 

Suppose L = L(A). We show how to create a DFA A = (Q', E, <5', g , -F') accepting L a) 6. 
The idea is that w = c$c\ • ■ ■ c n _i should be accepted if and only if there exists a word x G L 
such that 

X = XqCqXiCi ■ ■ ■ X n _iC n _iX n , 

where x , xi, . . . , x n are words such that \x \ = b, \xi\ = a — 1 for 1 < i < n, and \x n \ < a. 

The state set is Q' = {q' } U {0, l} n . Thus all states except q' are boolean vectors. We 
let / be a boolean vector with l's in the positions corresponding to final states of F. 

We define the transition function 5' as follows: 

ra-l 

8'(q' ,c) = [1 00^~ 0] M b M c ; 
5\q,c) = qM a ~ x M c , 

for all boolean vectors q and symbols c G E. Also define 

T = {q : there exists i, < % < a, such that q ■ M % • f — 1 }. 

Finally, set 




T U {go}, if L contains a word of length < b; 
T, otherwise. 



An easy induction on n now shows that if S'(q' , c Ci • • • c n _i) = f, then v has l's in the 
positions corresponding to all states of the form 5(q , x c ■ ■ ■ x n _iC n _i), where the words x^ 
satisfy the inequalities mentioned previously. It follows that L(A') = L a ^. 

Note that A' has 2 n + 1 states, and this quantity does not depend on a or b. There are 
only finitely many languages with this property. □ 

3 The context-free case 

Theorem 4. There exists a context-free language L such that filtering by the weak arithmetic 
progressions produces infinitely many distinct languages. 

Proof. Consider the language 

L = {10"2(0+3) n : n > 1}. 
Then it is easy to see that L is context-free, as it is generated by the context-free grammar 



s 


->■ 


10AB 


A 


->■ 


OAB | 2 


B 


->■ 


OB | 03 
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We claim that the languages L O)0 for a > 2 are all distinct. To see this, it suffices to show 
that L afi n 123+ = {m"- 1 }. 

Clearly 123 - 1 = z afi , where z = W a - 1 2(Q a - 1 3) a - 1 G L. 

Now suppose x G L a ,o fl 123 + . Then x = w a ^ for some w & L. Since each word in L starts 
10 n 2 and contains no other 2's, we must have n — a — 1. It follows that w G 10 a_1 2(0 + 3) a_1 . 
But then w contains only a — 1 3's, so to get a — 1 3's in x, each of them must be used. It 
follows that the exponent of in each + 3 is a — 1, and sox = 123 a_1 . 

This completes the proof. □ 

Theorem 5. There exists a context-free language such that L filtered by the shifts results in 
infinitely many distinct languages. 

Proof. Let L = {0 n l n : n > 0}. Then each of the languages is distinct, as for each 
b > 0, the word l b is the longest word of the form 1* in L\^. □ 



4 The operation diag 

Inspired by |2J, which considered the transposition of words arranged into square arrays, we 
introduce the following natural operation on words of length n 2 for some integer n > 1: we 
arrange the letters of the word w = aoai • • • a n 2 -i in row major order in a square array, 

a a\ ■ ■ ■ a n _! 

a n a n+l ' ' ' &2n-l 

Q-n 2 — n Ojn 2 — rt+1 ' ' ' *^n 2 — 1 

and then take the diagonal aoa n+ ia2 n +2 • • • a n 2 -i- We call the result diag(ty). Thus, for exam- 
ple, diag(absorbent) = art. Diagonals of matrices have long been studied in mathematics. 
We extend diag to languages L as follows: 

diag(L) = {diag(to) : w G L and there exists n > 1 such that \w\ = n 2 }. 

Theorem 6. If L is regular then so is diag(L). 

Proof. Given a DFA A = (Q, E, 5, q , F) accepting L, we construct an NFA A' = (Q', E, 5', q' , F') 
accepting diag(L). As in the proof of Theorem [21 we let M c be the n x n boolean incidence 
matrix of the underlying transition graph of the automaton corresponding to a transition on 
the symbol c G E, and we define M = Vcgs 

The idea is that w = a\ - ■ ■ a t G L(A') if and only if there exists x G L(A) such that 
x = a\X\ ■ ■ ■ a t -\Xt-iat where \xi\ = t for 1 < i < t. 

The states of A' are of the form [v, V, W] where v is a length-n boolean vector and V and 

n-1 

W are nxn boolean arrays. Let i — [1 ■ ■ ■ ] and / be the boolean vector corresponding 
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to the final states of A. The transitions of A' are given by 

5'(q' , c) = {[i- M c , M, X] : 3n > such that X = M n } 
8\[v,V,W],a) = {[vM a W,VM,W]}. 

for all c G £, and boolean vectors v, and boolean matrices V, W. The final states of A' are 

F' = {[v, V,W] : vf = 1 and V = W}. 

We leave it to the reader to verify that L(A') = diag(L). □ 

Theorem 7. There exists a context-free language L such that diag(L) is not context-free. 

Proof. For expository reasons, our example is over the alphabet {a, b, c, d, e, f, g, h, i, j, 0} of 
11 letters, although it is easy to reduce this. 
Consider 

L = {a0 3m+1 fe(0 + c) m - 2 + rf0 3n+1 e(0 + /) n ~ 2 + ^0 3p+1 /i(0 + ?) p - 2 + i : m,n,p > 3}. 

It is clear that L is context-free, as it is the concatenation LiL 2 L 3 of the three languages 

L x = {a0 3m+1 6(0 + c) m ~ 2 0+ : m>3} 
L 2 = {d0 3n+1 e(0 + /)™- 2 + : n > 3} 
L 3 = {#0 3p+1 /i(0 + *) p - 2 + j : p>3} 

each of which is easily seen to be context-free. 

We will show that diag(L) is not context-free by showing that 

L' := diag(L) n abc + def + ghi + j 

is not context-free. 

We claim that 11 = [ab^defghi 1 ] : t > 1}. It is easy to see that every word of the 
form abc^depghi 1 ] for t > 1 is in L', since we can take m = n = p = t + 2, and the exponent 
of in each + term to be 3m + 1. 

It remains to see that these are the only words of the form abc + def + ghi + j in L'. Let 
x G L', and let y G L such that x = diag(y). Then since the first two symbols of x 
must be ab, and since they are separated by 3m + 1 0's for some m > 3, it must be that 
\y\ — (3m + l) 2 . Then \x\ — 3m + 1. We can repeat the argument with the letters d, e and 
g, h to get m = n = p. Removing the single occurrence of each letter a, b, d, e, g, h,j from x 
leaves 3m — 6 letters, which must be chosen from {c, /, i}. But there are only m — 2 possible 
occurrences of each of the letters c, /, i in y, so each occurrence of these letters must appear 
on the diagonal of y to get x. Then these letters must be separated by 3m + 1 0's. Thus 
x = abc m - 2 def m - 2 ghi m - 2 j. 

Now an easy argument from the pumping lemma shows that V is not context-free. Hence 
diag(L) is not context-free. □ 
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